Goto

Collaborating Authors

 training procedure



example where multi step outperforms one step

Neural Information Processing Systems

As explained in the main text, this section presents an example that is only a slight modification of the one in Figure 4, but where a multi-step approach is clearly preferred over just one step. The data-generating and learning processes are exactly the same (100 trajectories of length 100, discount 0.9, ฮฑ = 0.1for reverse KL regularization). The only difference is that rather than using a behavior that is a mixture of optimal and uniform, we use a behavior that is a mixture of maximally suboptimal and uniform. If we call the suboptimal policy ฯ€ (which always goes down and left in our gridworld), then the behavior for the modified example is ฮฒ = 0.2 ฯ€ +0.8 u, where uis uniform. Results are shown in Figure 7. Figure 7: A gridworld example with modified behavior where multi-step is much better than one-step.



Appendix A Proof of Theorem 2.1

Neural Information Processing Systems

We have the following lemma. Using the notation of Lemma A.1, we have E The third inequality uses the Lipschitz assumption of the loss function. Figure 10 supplements'Relation to disagreement ' at the end of Section 2. It shows an example where the behavior of inconsistency is different from disagreement. All the experiments were done using GPUs (A100 or older). The goal of the experiments reported in Section 3.1 was to find whether/how the predictiveness of The arrows indicate the direction of training becoming longer.




Video Prediction via Selective Sampling

Neural Information Processing Systems

This module is trained in an adversarial learning manner [5]. The Selectionmodule selects high possibility candidates from proposals and combines to produce the final prediction, according to the criteria of better position matching.